Intro

The objective of this project is to predict Airbnb rent based on 90 features. These features include various data types, and data cleaning and pre-processing is required before modeling. The result will be evaluated with RMSE, the lower the score is the better the model predicts.

Initiating R Environment

The following code will initiate the R environment with required R packages and helper functions. Since the project will create an artificial neural network, TensorFlow and Keras is imported with miniconda, of which will facilitate any GPU computation that is needed for the network. Other than that, the flowing code also sets constant value for the entire project, including seed and number of cores.

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 458072 24.5     989307 52.9   638940 34.2
## Vcells 833298  6.4    8388608 64.0  1635956 12.5
## Command for sourcing the URL:
##   downloader::source_url("https://raw.githubusercontent.com/DMinghao/Analysis_Pocketknife/main/R/init_env.R", sha="9e657bc82974025a094b66de35cb78eb826621ab")
## Hash 9e657bc82974025a094b66de35cb78eb826621ab matches expected value.
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
## The following object is masked from 'package:plotly':
## 
##     slice
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:tensorflow':
## 
##     train
## The following object is masked from 'package:purrr':
## 
##     lift
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## 
## Attaching package: 'sylly'
## The following object is masked from 'package:psych':
## 
##     describe
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
## 
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
## 
##     tokenize
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## Attaching package: 'tm'
## The following object is masked from 'package:koRpus':
## 
##     readTagged
## Loading required package: ParamHelpers
## Loading required package: checkmate
## Loading required package: mlr
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
## 
## Attaching package: 'mlr'
## The following object is masked from 'package:caret':
## 
##     train
## The following object is masked from 'package:tensorflow':
## 
##     train
## The following object is masked from 'package:e1071':
## 
##     impute
## 
## Attaching package: 'DiceKriging'
## The following object is masked from 'package:checkmate':
## 
##     checkNames
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:fpc':
## 
##     dbscan
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## [1] "Miniconda is already installed. "
## [1] "r_tf_gpu  is already installed. "

Read Provided Data

The provided data is broken down to two separate files, one for training, of which contain a price column, and a scoring data set that does not have the price column.

writeSubmit <- function(pred) {
  submissionFile = data.frame(id = scoringData$id, price = pred)
  write.csv(submissionFile, 'submission.csv', row.names = F)
}

rawData <- read_csv('./input/rentlala2021/analysisData.csv')
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   name = col_character(),
##   summary = col_character(),
##   space = col_character(),
##   description = col_character(),
##   neighborhood_overview = col_character(),
##   notes = col_character(),
##   transit = col_character(),
##   access = col_character(),
##   interaction = col_character(),
##   house_rules = col_character(),
##   host_name = col_character(),
##   host_since = col_date(format = ""),
##   host_location = col_character(),
##   host_about = col_character(),
##   host_response_time = col_character(),
##   host_response_rate = col_character(),
##   host_acceptance_rate = col_character(),
##   host_is_superhost = col_logical(),
##   host_neighbourhood = col_character(),
##   host_verifications = col_character()
##   # ... with 29 more columns
## )
## i Use `spec()` for the full column specifications.
scoringData <- read_csv('./input/rentlala2021/scoringData.csv')
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   name = col_character(),
##   summary = col_character(),
##   space = col_character(),
##   description = col_character(),
##   neighborhood_overview = col_character(),
##   notes = col_character(),
##   transit = col_character(),
##   access = col_character(),
##   interaction = col_character(),
##   house_rules = col_character(),
##   host_name = col_character(),
##   host_since = col_date(format = ""),
##   host_location = col_character(),
##   host_about = col_character(),
##   host_response_time = col_character(),
##   host_response_rate = col_character(),
##   host_acceptance_rate = col_character(),
##   host_is_superhost = col_logical(),
##   host_neighbourhood = col_character(),
##   host_verifications = col_character()
##   # ... with 30 more columns
## )
## i Use `spec()` for the full column specifications.

Correcting column data type

By doing some column comparison, we can see that two data sets contains columns that does not match each other. Thus, some preliminary house keeping is needed to keep all columns in two data set in sync.

compare_df_cols(rawData, scoringData) %>%
  filter(rawData != scoringData |
           rawData %>% is.na() | scoringData %>% is.na())
##   column_name rawData scoringData
## 1       price numeric        <NA>
## 2     zipcode numeric   character
rawData %>% mutate(license = license %>% as.logical()) -> rawData
scoringData  %>% mutate(zipcode = zipcode %>% as.numeric()) -> scoringData
## Warning in zipcode %>% as.numeric(): NAs introduced by coercion
compare_df_cols(rawData, scoringData) %>%
  filter(rawData != scoringData | rawData %>% is.na() | scoringData %>% is.na())
##   column_name rawData scoringData
## 1       price numeric        <NA>

Merging scoring data and training data

The following code will create a price column for the scoring data such that two data sets can bind together as one data set for the upcoming data pre-processing and cleaning.

pricedScoringData <- scoringData %>% mutate(price = -1)

allData <- rawData %>% bind_rows(pricedScoringData) 

Pre-Processing Data

Due to the fact that most models can not handle missing data, character data, or unscaled data, it would be necessary to clean and preprocess data before moving forward. Other than that, it might be helpful to do some feature engineering such that models can have more data to work with. However, considering the fact that most models tend to run longer and perform worse with more feature added, feature selection will be needed to improve model performance and/or accuracy.

Grouping Columns with Same Data Type

Given that there are different data types in the data set, bulk processing columns with the same data type might speed up the process. The following code serves the purpose of putting columns with same data type together to help with later data cleaning.

# Extract ID column 
extractID <- allData %>% select(id)

allData <- allData %>% select(-id)

# Put similar columns into baskets 
numericCols <- allData %>% select(is.numeric) %>% colnames()
boolCols <- allData %>% select(is.logical) %>% colnames()
dateCols <- allData %>% select(is.Date) %>% colnames()
charCols <- allData %>% select(is.character) %>% colnames()
longCharCols <-
  allData[charCols] %>% 
  select(c(name, 
      summary, 
      space, 
      description, 
      neighborhood_overview, 
      notes, 
      transit, 
      access, 
      interaction, 
      house_rules, 
      host_about
  )) %>% 
  colnames()
factorCharCols <- 
  allData[charCols] %>% 
  select(-c(name, 
      summary, 
      space, 
      description, 
      neighborhood_overview, 
      notes, 
      transit, 
      access, 
      interaction, 
      house_rules, 
      host_about
  )) %>% 
  select(-c(
    host_name, 
    host_verifications, 
    host_response_time,
    calendar_updated, 
    host_response_rate, 
    host_acceptance_rate, 
    amenities
  )) %>% 
  colnames()

rateCols <- allData %>% 
  select(c(host_response_rate, host_acceptance_rate)) %>% 
  colnames()

# Encode all text columns to UTF-8 
allData[charCols] <- allData[charCols] %>% mutate_all(funs(enc2utf8(.)))

Data Wrangling and Cleaning

Having the data in various format and type prevents us to extract information from the data. Hence, the following section will help transform some columns such that they will be easier to work with.

Zip code column

Given that there exists some missing zip codes in the data set, we can fill in those data point with the most common zip code in their corresponding area.

mostCommonZip <- allData %>% 
  select(c(neighbourhood_cleansed, neighbourhood_group_cleansed, city, zipcode)) %>% 
  group_by(neighbourhood_cleansed, zipcode) %>% 
  summarise(count = n()) %>% 
  filter(count == max(count)) %>% 
  ungroup()
## `summarise()` has grouped output by 'neighbourhood_cleansed'. You can override using the `.groups` argument.
getZip <- function(neighbourhood_cleansed){
  mostCommonZip %>% filter(neighbourhood_cleansed == neighbourhood_cleansed) %>% select(zipcode) %>% pull(zipcode)
}

allData <- allData %>% mutate(zipcode = ifelse(is.na(zipcode), getZip(neighbourhood_cleansed), zipcode)) %>% mutate(zipcode = zipcode %>% as.factor())

Rate columns

There exists some text columns that are in the percentage format. We can assume that all NA values in these columns are 0, since, in a real world scenario, it’s reasonable that theses NA values are derived from dividing 0. For the rest of the data point, we can parse to numeric.

allData[rateCols] <- allData[rateCols] %>% 
  mutate(host_response_rate = gsub("N/A","0%", host_response_rate)) %>% 
  mutate(host_acceptance_rate = gsub("N/A","0%", host_acceptance_rate)) %>% 
  mutate(host_response_rate = gsub("%","", host_response_rate)) %>% 
  mutate(host_acceptance_rate = gsub("%","", host_acceptance_rate)) %>% 
  mutate_all(as.numeric) %>% 
  mutate_each(funs(./100)) %>% 
  replace(is.na(.),0)
## Warning: `mutate_each_()` was deprecated in dplyr 0.7.0.
## Please use `across()` instead.

Handle non-char columns NA value

The following code handles square_feet, weekly_price, and monthly_price columns’ NA value. It might not be the best way, but setting these values to 0 is safe. One alternative will be using MISE imputation, but since the number of missing value is too large, it tempers the distribution of these columns. Another way that can be use is to train models that predicts these values.

allData[boolCols] <- allData[boolCols] %>% replace(is.na(.),F)

reservedNumCols <- allData %>% select(c(square_feet, weekly_price, monthly_price)) %>% colnames()

allData[numericCols[numericCols %!in% reservedNumCols]] <- 
  allData[numericCols[numericCols %!in% reservedNumCols]] %>% replace(is.na(.),0)

allData <- allData %>% mutate(square_feet = ifelse(is.na(square_feet), 0, square_feet))
allData <- allData %>% mutate(weekly_price = ifelse(is.na(weekly_price), 0, weekly_price))
allData <- allData %>% mutate(monthly_price = ifelse(is.na(monthly_price), 0, monthly_price))

Factor columns

There exists several text columns that can be converted to factors. Some of them are only having one level, so they should be drop. On the other hand, I decided to keep columns with large amount of levels as well, but it would be wise to down size these factors and grouping low frequency factors into other.

allData[factorCharCols] <- allData[factorCharCols] %>%
  replace(is.na(.), "N/A") %>%
  mutate_all(as.factor)

# Exclude 1 level factor columns 
allData <- allData %>% select(-c(country_code, country, state, market)) 

Host verification and amenities

For host verification and amenities columns, they are being transformed into one hot dummy columns per each unique items. This will allow models to process information in these two columns easier. The original columns are then dropped after being processed.

# Create verification count column 
allData <- allData %>% 
  mutate(host_verifications = gsub("\\[|\\]|\\'|\\,", "", host_verifications)) %>% 
  mutate(vari_count = strsplit(host_verifications, " ") %>% lengths())

vari_list <- allData %>% 
  select(host_verifications) %>% 
  lapply(function(x) unique(trimws(unlist(strsplit(x, " ")))))

for(w in 1:length(vari_list[[1]])) {
  new <- grepl(pattern = vari_list[[1]][w], x = allData$host_verifications, fixed = TRUE)
  allData[paste(vari_list[[1]][w], "_vari")] <- new
}

# Create amenities count column 
allData <- allData %>% 
  mutate(amenities_count = strsplit(amenities, ",") %>% lengths()) 

amen_list <- allData %>% 
  select(amenities) %>% 
  mutate(amenities = gsub("\\.", "", amenities)) %>% 
  lapply(function(x) unique(trimws(unlist(strsplit(x, ",")))))

for(w in 1:length(amen_list[[1]])) {
  new <- grepl(pattern = amen_list[[1]][w], x = allData$amenities, fixed = TRUE)
  allData[paste(amen_list[[1]][w], "_amen")] <- new
}

# discard original column 
allData <- allData %>% select(-c(amenities, host_verifications)) %>% clean_names()

Duration columns

Duration columns are messy, since they are all in text format and does not have consistent step size. Therefore after stripping all text component in these columns, a column specific step size is chosen to help rescaling the duration.

allData <- allData %>%
  mutate(host_response_time = gsub("within a ", "", host_response_time)) %>%
  mutate(host_response_time = gsub("within an ", "", host_response_time)) %>%
  mutate(host_response_time = gsub("few hours", "12", host_response_time)) %>%
  mutate(host_response_time = gsub("hour", "1", host_response_time)) %>%
  mutate(host_response_time = gsub("a few days or more", "48", host_response_time)) %>%
  mutate(host_response_time = gsub("day", "24", host_response_time)) %>%
  mutate(host_response_time = replace_na(host_response_time, "N/A")) %>% 
  mutate(host_response_time = gsub("N/A", "96", host_response_time)) %>%
  mutate(host_response_time = as.numeric(host_response_time))

allData <- allData %>% 
  mutate(calendar_updated = gsub(" ago", "", calendar_updated)) %>% 
  mutate(calendar_updated = gsub("today", "0", calendar_updated)) %>% 
  mutate(calendar_updated = gsub("yesterday", "1", calendar_updated)) %>% 
  mutate(calendar_updated = case_when(
    grepl("days", calendar_updated) ~ as.numeric(gsub("([0-9]+).*$", "\\1", calendar_updated)) %>% as.character(), 
    grepl("weeks", calendar_updated) ~ as.character(as.numeric(gsub("([0-9]+).*$", "\\1", calendar_updated))*7),
    grepl("months", calendar_updated) ~ as.character(as.numeric(gsub("([0-9]+).*$", "\\1", calendar_updated))*30), 
    grepl("a week", calendar_updated) ~ "7",
    grepl("week", calendar_updated) ~ "7",
    grepl("never", calendar_updated) ~ "3000",
    TRUE ~ as.character(calendar_updated)
  )) %>% 
  mutate(calendar_updated = as.numeric(calendar_updated))
## Warning in as.numeric(gsub("([0-9]+).*$", "\\1", calendar_updated)) %>% : NAs
## introduced by coercion
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion

Date columns

Some models can not process date columns, so we need to transform them to numeric. Using the latest date as 0, calculate other dates’ “distance” to the most resent date to get a consistent numeric transformation.

allData[dateCols] <- allData %>% 
  select(dateCols) %>% 
  mutate_all(funs(max(., na.rm = TRUE) - .)) %>% 
  mutate_all(as.numeric)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(dateCols)` instead of `dateCols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

Feature Engneering

Having 90 columns is good, but why not more? Some models will benefit from having more features, especially meaningful features and features that were not created from other numeric features.

Mean price for areas

Considering that if I as a user on the platform and need to price my property, it would be helpful to know the average price in my area. The same applies to the artificial stupidity, providing an average price for each area is like helping it to cheat on a test, the result might be phenomenal. The following code adds three new columns to the data set: mean price for neighbourhood, for neighbourhood group, and for zip code.

mean_price <- allData %>% 
  filter(price > -1) %>% 
  group_by(neighbourhood_cleansed = neighbourhood_cleansed) %>%
    summarize(record_count_c = n(), 
              price_mean_c = mean(price))

allData <- allData %>% left_join(mean_price, by = c("neighbourhood_cleansed" = "neighbourhood_cleansed"))
allData[is.na(allData$price_mean_c),]$price_mean_c = mean(allData["price" > -1, ]$price_mean_c, na.rm = TRUE)

mean_price2 <- allData %>% 
  filter(price > -1) %>% 
  group_by(neighbourhood_group_cleansed = neighbourhood_group_cleansed) %>%
  summarize(price_mean_ngc = mean(price))

allData <- allData %>% left_join(mean_price2, by = c("neighbourhood_group_cleansed" = "neighbourhood_group_cleansed"))

mean_price3 <- allData %>% 
  filter(price > -1) %>% 
  group_by(zipcode = zipcode) %>%
  summarize(price_mean_zip = mean(price))

allData <- allData %>% left_join(mean_price3, by = c("zipcode" = "zipcode"))
allData[is.na(allData$price_mean_zip),]$price_mean_zip = mean(allData["price" > -1, ]$price_mean_zip, na.rm = TRUE)

Host Gender

This new feature might be excessive, but I just can’t help myself. Knowing an host’s name gives the opportunity to guess their gender. Though, it might not be something that influence the price, but who knows, maybe male tends to price their property higher than female.

allData %>% select(host_name) %>% c() -> names 

allData <- allData %>%
  left_join(gender(names$host_name) %>%
              distinct() %>%
              select(c(name, proportion_male, proportion_female)),
            by = c("host_name" = "name"))

Zip code coordinate

Having the zip code column might not be helpful since they are all just numbers without meanings. Therefore, adding each zip code’s coordinate might be something useful. In a real world scenario, it’s safe to assume that properties in some areas is priced hiher than other, but it’s up to the model to findout.

ZipCodes <- read.table(unz("US.zip","US.txt"), sep="\t")
names(ZipCodes) = c("CountryCode", "zip", "PlaceName", 
"AdminName1", "AdminCode1", "AdminName2", "AdminCode2", 
"AdminName3", "AdminCode3", "latitude", "longitude", "accuracy") 
ZipCodes <- ZipCodes %>% mutate(zip = as.factor(zip))
allData <- allData %>% 
  left_join(
    ZipCodes %>% 
      select(c(zip, 
               PlaceName, 
               AdminName2, 
               latitude, 
               longitude)), by = c("zipcode"="zip")
  )

allData <- allData %>% mutate(PlaceName = PlaceName %>% as.factor(), 
                              AdminName2 = AdminName2 %>% as.factor())

Text mining

The fun part is that all these “useless” long text columns can also be transformed to numeric columns. The textfeatures function will generate columns like word count, url count, exclamation mark count, etc. What is even better is that it also generates sentiment analysis and LDA topic clustering analysis on these text columns. Are these columns useful? It’s up to the model to find out.

allData[longCharCols] <- allData[longCharCols] %>% replace(is.na(.), "")

allText <- NULL
for(col in longCharCols) allText <- paste(allText, " ", allData[[col]])

tex_feat <- textfeatures(allText)
## <U+21AA> Counting features in text...
## <U+21AA> Sentiment analysis...
## <U+21AA> Parts of speech...
## <U+21AA> Word dimensions started
## <U+21AA> Normalizing data
## <U+2714> Job's done!
allData <- allData %>% bind_cols(tex_feat)

Create 2 and 3 power columns

Assuming that not all features are having leaner relationship with the price, rasing 2 and 3 power to all numeric features might help the model to fit better.

copyData <- allData

allData <- allData %>% cbind( copyData %>%
  select(where(is.numeric)) %>%
  select(-price) %>%
  mutate_all(function(x) x^2) %>%
  setNames(paste0(names(.), "_2pow")))
allData <- allData %>% cbind( copyData %>%
  select(where(is.numeric)) %>%
  select(-price) %>%
  mutate_all(function(x) x^3) %>%
  setNames(paste0(names(.), "_3pow")))

Scale columns

Scaling is not necessary for some models, and not all models benifits from scaled data, but here we are, why not.

numCols <- allData %>% select(is.numeric) %>% colnames()
facCols <- allData %>% select(is.factor) %>% colnames()
boolCols <- allData %>% select(is.logical) %>% colnames()

allData[numCols] <- allData[numCols] %>% 
  replace(is.na(.),0) %>% 
  mutate_at(vars(-price), funs(scale))

allData[facCols] <- allData[facCols] %>%
  mutate_all(funs(as.numeric(.)-1))

allData[boolCols] <- allData[boolCols] %>% 
  mutate_all(funs(as.numeric(.)))

Remove 0 variance columns

Zero variance features are meaningless, it’s wise to drop them.

remove0var <- function(dat) {
    out <- lapply(dat, function(x) length(unique(x)))
    want <- which(!out > 1)
    name <- unlist(want) %>% names()
    print(name)
    dat %>% select(-all_of(name))
}

allData <- allData %>% remove0var()
## [1] "has_availability"                             
## [2] "requires_license"                             
## [3] "license"                                      
## [4] "jurisdiction_names"                           
## [5] "is_business_travel_ready"                     
## [6] "translation_missing_enhosting_amenity_50_amen"
## [7] "translation_missing_enhosting_amenity_49_amen"

Clustering Data

Considering rent price can be classified as high, medium, and low price range, let’s consider would it be helpful if the model also knows if a property is in which price range? Having this in mind, clustering data might segregate data into different price range. The following code should generates 15 clusters for the data set. Due to the time limit, I just used the simplest clustering method, k-mean, and hand picked the number of cluster. However, HDBSCAN would be a better choice of clustering method in this case. (too bad it runs too long and does not support parallel processing)

beforeCluster <- allData 

allData <- beforeCluster %>% select(!is.character)

wss <- (nrow(allData)-1)*sum(apply(allData,2,var))
for (i in 2:20) { # increase for better result 
  set.seed(seed)
  # print(i)
  clu <- kmeans(allData %>% select(-price), centers=i)
  wss[i] <- sum(clu$withinss) 
  }
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

set.seed(seed)
kc <- kmeans(allData, centers=15)

beforeCluster %>% mutate(cluster = kc$cluster %>% as.factor()) -> afterCluster

Generate mean price for clusters

Like mean price for areas, haveing the mean price for each cluster might also help the model find the more appropreate price.

mean_price_cluster <- afterCluster %>% 
  filter(price > -1) %>% 
  group_by(cluster = cluster) %>%
  summarize(price_mean_clus = mean(price))

afterCluster <- afterCluster %>% left_join(mean_price_cluster, by = c("cluster" = "cluster"))

# plotting each cluster's price distribution 
pp <- afterCluster %>% select(c(price, cluster)) %>% ggplot(aes(x = price, color = cluster)) + geom_density()
pp %>% ggplotly()
allData <- afterCluster %>% mutate(cluster = cluster %>% as.numeric())

Prepare for modeling

After all data are being cleaned, pre-processed, and transformed, we need to prep it for modeling.

Separate Train and Test

Training set and testing set are separated.

trainData <- allData %>%
    select(!is.character) %>% 
    filter(price > -1) 
testData <- allData %>%
    select(!is.character) %>% 
    filter(price == -1) %>% select(-price)

Feature selection

To lighten the load of some models, we can perform a feature selection on our data set. Some models might even perform better with less model. The following code uses boruta algorithm, a tree based model, to calculate feature importance.

boruta_output <-
  Boruta(
    price ~ .,
    data = trainData,
    pValue = 0.05,
    maxRuns = 500,
    doTrace = 2,
    getImp = getImpXgboost, # delete this line for better result, keep it for faster output
    nthread=cores,
  )
##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
##  11. run of importance source...
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
##  15. run of importance source...
## After 15 iterations, +3.2 mins:
##  confirmed 14 attributes: accommodates, bathrooms, bedrooms, cleaning_fee, cluster and 9 more;
##  rejected 1024 attributes: accessible_height_bed_amen, accessible_height_toilet_amen, accommodates_2pow, accommodates_3pow, AdminName2 and 1019 more;
##  still have 26 attributes left.
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
## After 19 iterations, +3.2 mins:
##  confirmed 1 attribute: availability_365;
##  rejected 4 attributes: price_mean_ngc, w125, w139_2pow, w56;
##  still have 21 attributes left.
##  20. run of importance source...
##  21. run of importance source...
##  22. run of importance source...
##  23. run of importance source...
## After 23 iterations, +3.3 mins:
##  rejected 1 attribute: zipcode;
##  still have 20 attributes left.
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
## After 26 iterations, +3.3 mins:
##  rejected 1 attribute: n_nonasciis;
##  still have 19 attributes left.
##  27. run of importance source...
##  28. run of importance source...
##  29. run of importance source...
##  30. run of importance source...
## After 30 iterations, +3.3 mins:
##  confirmed 1 attribute: w50;
##  still have 18 attributes left.
##  31. run of importance source...
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
## After 39 iterations, +3.4 mins:
##  confirmed 1 attribute: minimum_nights;
##  rejected 1 attribute: host_since;
##  still have 16 attributes left.
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
## After 42 iterations, +3.4 mins:
##  rejected 1 attribute: calculated_host_listings_count_entire_homes;
##  still have 15 attributes left.
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
## After 69 iterations, +3.6 mins:
##  confirmed 1 attribute: minimum_nights_avg_ntm;
##  still have 14 attributes left.
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
## After 75 iterations, +3.6 mins:
##  confirmed 1 attribute: availability_60;
##  still have 13 attributes left.
##  76. run of importance source...
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
##  84. run of importance source...
##  85. run of importance source...
##  86. run of importance source...
##  87. run of importance source...
## After 87 iterations, +3.7 mins:
##  confirmed 1 attribute: calculated_host_listings_count;
##  still have 12 attributes left.
##  88. run of importance source...
##  89. run of importance source...
##  90. run of importance source...
## After 90 iterations, +3.7 mins:
##  confirmed 1 attribute: first_review;
##  still have 11 attributes left.
##  91. run of importance source...
##  92. run of importance source...
##  93. run of importance source...
##  94. run of importance source...
##  95. run of importance source...
## After 95 iterations, +3.7 mins:
##  confirmed 1 attribute: n_commas;
##  still have 10 attributes left.
##  96. run of importance source...
##  97. run of importance source...
##  98. run of importance source...
##  99. run of importance source...
##  100. run of importance source...
## After 100 iterations, +3.8 mins:
##  confirmed 1 attribute: n_charsperword_2pow;
##  still have 9 attributes left.
##  101. run of importance source...
##  102. run of importance source...
##  103. run of importance source...
##  104. run of importance source...
##  105. run of importance source...
##  106. run of importance source...
##  107. run of importance source...
##  108. run of importance source...
##  109. run of importance source...
##  110. run of importance source...
##  111. run of importance source...
##  112. run of importance source...
##  113. run of importance source...
##  114. run of importance source...
##  115. run of importance source...
##  116. run of importance source...
##  117. run of importance source...
##  118. run of importance source...
##  119. run of importance source...
##  120. run of importance source...
##  121. run of importance source...
##  122. run of importance source...
##  123. run of importance source...
##  124. run of importance source...
##  125. run of importance source...
## After 125 iterations, +3.9 mins:
##  confirmed 1 attribute: reviews_per_month;
##  still have 8 attributes left.
##  126. run of importance source...
##  127. run of importance source...
##  128. run of importance source...
##  129. run of importance source...
##  130. run of importance source...
##  131. run of importance source...
##  132. run of importance source...
##  133. run of importance source...
##  134. run of importance source...
##  135. run of importance source...
##  136. run of importance source...
##  137. run of importance source...
##  138. run of importance source...
##  139. run of importance source...
##  140. run of importance source...
##  141. run of importance source...
##  142. run of importance source...
##  143. run of importance source...
##  144. run of importance source...
##  145. run of importance source...
##  146. run of importance source...
##  147. run of importance source...
##  148. run of importance source...
##  149. run of importance source...
##  150. run of importance source...
##  151. run of importance source...
##  152. run of importance source...
##  153. run of importance source...
##  154. run of importance source...
##  155. run of importance source...
##  156. run of importance source...
##  157. run of importance source...
##  158. run of importance source...
##  159. run of importance source...
##  160. run of importance source...
##  161. run of importance source...
## After 161 iterations, +4.2 mins:
##  rejected 1 attribute: elevator_amen;
##  still have 7 attributes left.
##  162. run of importance source...
##  163. run of importance source...
## After 163 iterations, +4.2 mins:
##  rejected 1 attribute: building_staff_amen;
##  still have 6 attributes left.
##  164. run of importance source...
##  165. run of importance source...
##  166. run of importance source...
##  167. run of importance source...
##  168. run of importance source...
##  169. run of importance source...
##  170. run of importance source...
##  171. run of importance source...
##  172. run of importance source...
##  173. run of importance source...
##  174. run of importance source...
##  175. run of importance source...
## After 175 iterations, +4.3 mins:
##  confirmed 1 attribute: w194;
##  still have 5 attributes left.
##  176. run of importance source...
##  177. run of importance source...
##  178. run of importance source...
##  179. run of importance source...
##  180. run of importance source...
##  181. run of importance source...
##  182. run of importance source...
## After 182 iterations, +4.3 mins:
##  confirmed 1 attribute: record_count_c;
##  still have 4 attributes left.
##  183. run of importance source...
##  184. run of importance source...
##  185. run of importance source...
##  186. run of importance source...
##  187. run of importance source...
##  188. run of importance source...
##  189. run of importance source...
##  190. run of importance source...
##  191. run of importance source...
##  192. run of importance source...
##  193. run of importance source...
##  194. run of importance source...
##  195. run of importance source...
##  196. run of importance source...
##  197. run of importance source...
##  198. run of importance source...
##  199. run of importance source...
##  200. run of importance source...
##  201. run of importance source...
##  202. run of importance source...
##  203. run of importance source...
##  204. run of importance source...
##  205. run of importance source...
##  206. run of importance source...
##  207. run of importance source...
##  208. run of importance source...
##  209. run of importance source...
##  210. run of importance source...
##  211. run of importance source...
##  212. run of importance source...
##  213. run of importance source...
##  214. run of importance source...
##  215. run of importance source...
##  216. run of importance source...
##  217. run of importance source...
##  218. run of importance source...
##  219. run of importance source...
##  220. run of importance source...
##  221. run of importance source...
##  222. run of importance source...
##  223. run of importance source...
##  224. run of importance source...
##  225. run of importance source...
##  226. run of importance source...
##  227. run of importance source...
##  228. run of importance source...
##  229. run of importance source...
##  230. run of importance source...
##  231. run of importance source...
##  232. run of importance source...
##  233. run of importance source...
##  234. run of importance source...
##  235. run of importance source...
##  236. run of importance source...
##  237. run of importance source...
##  238. run of importance source...
##  239. run of importance source...
##  240. run of importance source...
##  241. run of importance source...
##  242. run of importance source...
##  243. run of importance source...
##  244. run of importance source...
##  245. run of importance source...
##  246. run of importance source...
##  247. run of importance source...
##  248. run of importance source...
##  249. run of importance source...
##  250. run of importance source...
##  251. run of importance source...
##  252. run of importance source...
##  253. run of importance source...
##  254. run of importance source...
##  255. run of importance source...
##  256. run of importance source...
##  257. run of importance source...
##  258. run of importance source...
##  259. run of importance source...
##  260. run of importance source...
##  261. run of importance source...
##  262. run of importance source...
##  263. run of importance source...
##  264. run of importance source...
##  265. run of importance source...
##  266. run of importance source...
##  267. run of importance source...
##  268. run of importance source...
##  269. run of importance source...
##  270. run of importance source...
##  271. run of importance source...
##  272. run of importance source...
## After 272 iterations, +4.8 mins:
##  confirmed 1 attribute: security_deposit;
##  still have 3 attributes left.
##  273. run of importance source...
##  274. run of importance source...
##  275. run of importance source...
##  276. run of importance source...
##  277. run of importance source...
##  278. run of importance source...
##  279. run of importance source...
##  280. run of importance source...
##  281. run of importance source...
##  282. run of importance source...
##  283. run of importance source...
##  284. run of importance source...
##  285. run of importance source...
##  286. run of importance source...
##  287. run of importance source...
## After 287 iterations, +4.9 mins:
##  confirmed 1 attribute: w160;
##  still have 2 attributes left.
##  288. run of importance source...
##  289. run of importance source...
##  290. run of importance source...
##  291. run of importance source...
##  292. run of importance source...
##  293. run of importance source...
##  294. run of importance source...
##  295. run of importance source...
##  296. run of importance source...
##  297. run of importance source...
##  298. run of importance source...
##  299. run of importance source...
##  300. run of importance source...
##  301. run of importance source...
##  302. run of importance source...
##  303. run of importance source...
##  304. run of importance source...
##  305. run of importance source...
##  306. run of importance source...
##  307. run of importance source...
##  308. run of importance source...
##  309. run of importance source...
##  310. run of importance source...
##  311. run of importance source...
##  312. run of importance source...
##  313. run of importance source...
##  314. run of importance source...
##  315. run of importance source...
##  316. run of importance source...
##  317. run of importance source...
##  318. run of importance source...
##  319. run of importance source...
##  320. run of importance source...
##  321. run of importance source...
##  322. run of importance source...
##  323. run of importance source...
##  324. run of importance source...
##  325. run of importance source...
##  326. run of importance source...
##  327. run of importance source...
##  328. run of importance source...
##  329. run of importance source...
##  330. run of importance source...
##  331. run of importance source...
##  332. run of importance source...
##  333. run of importance source...
##  334. run of importance source...
##  335. run of importance source...
##  336. run of importance source...
##  337. run of importance source...
##  338. run of importance source...
##  339. run of importance source...
##  340. run of importance source...
##  341. run of importance source...
##  342. run of importance source...
##  343. run of importance source...
##  344. run of importance source...
##  345. run of importance source...
##  346. run of importance source...
##  347. run of importance source...
##  348. run of importance source...
##  349. run of importance source...
##  350. run of importance source...
##  351. run of importance source...
##  352. run of importance source...
##  353. run of importance source...
##  354. run of importance source...
##  355. run of importance source...
##  356. run of importance source...
##  357. run of importance source...
##  358. run of importance source...
##  359. run of importance source...
##  360. run of importance source...
##  361. run of importance source...
##  362. run of importance source...
##  363. run of importance source...
##  364. run of importance source...
##  365. run of importance source...
##  366. run of importance source...
##  367. run of importance source...
##  368. run of importance source...
##  369. run of importance source...
##  370. run of importance source...
##  371. run of importance source...
##  372. run of importance source...
##  373. run of importance source...
##  374. run of importance source...
##  375. run of importance source...
##  376. run of importance source...
##  377. run of importance source...
##  378. run of importance source...
##  379. run of importance source...
##  380. run of importance source...
##  381. run of importance source...
##  382. run of importance source...
##  383. run of importance source...
##  384. run of importance source...
##  385. run of importance source...
##  386. run of importance source...
##  387. run of importance source...
##  388. run of importance source...
##  389. run of importance source...
##  390. run of importance source...
##  391. run of importance source...
##  392. run of importance source...
##  393. run of importance source...
##  394. run of importance source...
##  395. run of importance source...
##  396. run of importance source...
##  397. run of importance source...
##  398. run of importance source...
##  399. run of importance source...
##  400. run of importance source...
##  401. run of importance source...
##  402. run of importance source...
##  403. run of importance source...
##  404. run of importance source...
##  405. run of importance source...
##  406. run of importance source...
##  407. run of importance source...
##  408. run of importance source...
##  409. run of importance source...
##  410. run of importance source...
##  411. run of importance source...
##  412. run of importance source...
##  413. run of importance source...
##  414. run of importance source...
##  415. run of importance source...
##  416. run of importance source...
##  417. run of importance source...
##  418. run of importance source...
##  419. run of importance source...
##  420. run of importance source...
##  421. run of importance source...
##  422. run of importance source...
##  423. run of importance source...
##  424. run of importance source...
##  425. run of importance source...
##  426. run of importance source...
##  427. run of importance source...
##  428. run of importance source...
##  429. run of importance source...
##  430. run of importance source...
##  431. run of importance source...
##  432. run of importance source...
##  433. run of importance source...
##  434. run of importance source...
##  435. run of importance source...
##  436. run of importance source...
##  437. run of importance source...
##  438. run of importance source...
##  439. run of importance source...
##  440. run of importance source...
##  441. run of importance source...
##  442. run of importance source...
##  443. run of importance source...
##  444. run of importance source...
##  445. run of importance source...
##  446. run of importance source...
##  447. run of importance source...
##  448. run of importance source...
##  449. run of importance source...
##  450. run of importance source...
##  451. run of importance source...
##  452. run of importance source...
##  453. run of importance source...
##  454. run of importance source...
##  455. run of importance source...
##  456. run of importance source...
##  457. run of importance source...
##  458. run of importance source...
##  459. run of importance source...
##  460. run of importance source...
##  461. run of importance source...
##  462. run of importance source...
##  463. run of importance source...
##  464. run of importance source...
##  465. run of importance source...
##  466. run of importance source...
##  467. run of importance source...
##  468. run of importance source...
##  469. run of importance source...
##  470. run of importance source...
##  471. run of importance source...
##  472. run of importance source...
##  473. run of importance source...
##  474. run of importance source...
##  475. run of importance source...
##  476. run of importance source...
##  477. run of importance source...
##  478. run of importance source...
##  479. run of importance source...
##  480. run of importance source...
##  481. run of importance source...
##  482. run of importance source...
##  483. run of importance source...
##  484. run of importance source...
##  485. run of importance source...
##  486. run of importance source...
##  487. run of importance source...
##  488. run of importance source...
##  489. run of importance source...
##  490. run of importance source...
##  491. run of importance source...
##  492. run of importance source...
##  493. run of importance source...
##  494. run of importance source...
##  495. run of importance source...
##  496. run of importance source...
##  497. run of importance source...
##  498. run of importance source...
##  499. run of importance source...
boruta_dec <- attStats(boruta_output) %>% rownames_to_column()
boruta_dec[boruta_dec$decision!="Rejected",]
##                             rowname      meanImp    medianImp       minImp
## 2                     host_location 0.0033427998 0.0033617345 3.047511e-03
## 7                host_neighbourhood 0.0021625261 0.0021401783 1.557652e-03
## 12                           street 0.0021153634 0.0020174007 1.485645e-03
## 21                        room_type 0.1959787970 0.1959818572 1.938758e-01
## 22                     accommodates 0.0227605405 0.0227393011 2.186984e-02
## 23                        bathrooms 0.0216957316 0.0216981411 2.042193e-02
## 24                         bedrooms 0.0198406905 0.0197150066 1.900117e-02
## 30                 security_deposit 0.0006128238 0.0006154920 2.624091e-04
## 31                     cleaning_fee 0.0129010121 0.0129065781 1.188153e-02
## 32                  guests_included 0.0034149188 0.0033952619 2.412445e-03
## 34                   minimum_nights 0.0007221129 0.0006862727 5.258360e-04
## 40           minimum_nights_avg_ntm 0.0007946778 0.0008194881 0.000000e+00
## 44                  availability_60 0.0007950922 0.0006925343 2.701578e-04
## 46                 availability_365 0.0017642350 0.0017577582 1.187772e-03
## 49                     first_review 0.0006576752 0.0006347074 0.000000e+00
## 62   calculated_host_listings_count 0.0007851487 0.0007956137 9.578249e-05
## 66                reviews_per_month 0.0006977084 0.0006674355 3.775158e-04
## 157    pack_n_play_travel_crib_amen 0.0005587213 0.0004811592 4.767161e-04
## 239                  record_count_c 0.0005679868 0.0005416404 4.351942e-04
## 240                    price_mean_c 0.0252872969 0.0252895704 2.381056e-02
## 242                  price_mean_zip 0.0175870231 0.0175826144 1.690098e-02
## 257                        n_commas 0.0007293119 0.0007196047 4.267457e-04
## 332                             w50 0.0009930672 0.0009152918 5.420321e-04
## 409                            w127 0.0049106311 0.0048791081 4.269551e-03
## 442                            w160 0.0005748175 0.0005504995 0.000000e+00
## 476                            w194 0.0006005853 0.0005699136 2.803771e-05
## 560             n_charsperword_2pow 0.0007917217 0.0007619573 0.000000e+00
## 735                       w163_2pow 0.0005090496 0.0004929428 0.000000e+00
## 1063                        cluster 0.0257475562 0.0257783355 2.471778e-02
## 1064                price_mean_clus 0.6268139325 0.6269105073 6.223772e-01
##            maxImp  normHits  decision
## 2    0.0039837437 1.0000000 Confirmed
## 7    0.0032626345 1.0000000 Confirmed
## 12   0.0026547337 1.0000000 Confirmed
## 21   0.1967295665 1.0000000 Confirmed
## 22   0.0231964287 1.0000000 Confirmed
## 23   0.0225098318 1.0000000 Confirmed
## 24   0.0205361218 1.0000000 Confirmed
## 30   0.0011427048 0.6492986 Confirmed
## 31   0.0139397632 1.0000000 Confirmed
## 32   0.0039202326 1.0000000 Confirmed
## 34   0.0014344560 0.8396794 Confirmed
## 40   0.0014033275 0.8016032 Confirmed
## 44   0.0013389270 0.8356713 Confirmed
## 46   0.0024376726 0.9979960 Confirmed
## 49   0.0012600224 0.6933868 Confirmed
## 62   0.0015102141 0.8296593 Confirmed
## 66   0.0014420813 0.7434870 Confirmed
## 157  0.0008670325 0.5711423 Tentative
## 239  0.0009161837 0.6392786 Confirmed
## 240  0.0261493500 1.0000000 Confirmed
## 242  0.0185064882 1.0000000 Confirmed
## 257  0.0011875958 0.7795591 Confirmed
## 332  0.0017583969 0.9498998 Confirmed
## 409  0.0054291993 1.0000000 Confirmed
## 442  0.0012786424 0.6152305 Confirmed
## 476  0.0011192599 0.6452906 Confirmed
## 560  0.0017780480 0.7615230 Confirmed
## 735  0.0011220953 0.4969940 Tentative
## 1063 0.0258560092 1.0000000 Confirmed
## 1064 0.6283600051 1.0000000 Confirmed
selectedCols <- boruta_output %>% getSelectedAttributes(withTentative = TRUE)


trainData <- trainData %>% select(c(price, selectedCols))
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(selectedCols)` instead of `selectedCols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
testData <- testData %>% select(selectedCols)

Save to file (if needed)

if (saveProcessedData) {
  
  write.csv(trainData,
            file("processedTrainData.csv",encoding="UTF-8"),
            row.names = F)
  write.csv(testData,
            file("processedTestData.csv",encoding="UTF-8"),
            row.names = F)
}

Modeling

This is the exciting part, we will use three models to fit our training data and see which one performs the best.

Linear Regression

This is a baseline model, there is really nothing to see here. (the result is eye-burning)

linear <- lm(price~., data = trainData)

summary(linear)
## 
## Call:
## lm(formula = price ~ ., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -353.55  -27.23   -4.94   20.58  683.87 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.111e+01  1.238e+00  49.374  < 2e-16 ***
## host_location                  -2.225e-03  9.244e-04  -2.407 0.016101 *  
## host_neighbourhood             -1.066e-02  2.123e-03  -5.023 5.10e-07 ***
## street                         -7.309e-03  3.710e-03  -1.970 0.048859 *  
## room_type                      -1.977e+01  3.344e-01 -59.127  < 2e-16 ***
## accommodates                    1.447e+01  4.497e-01  32.189  < 2e-16 ***
## bathrooms                       5.382e+00  2.993e-01  17.981  < 2e-16 ***
## bedrooms                        1.122e+01  3.760e-01  29.838  < 2e-16 ***
## security_deposit               -1.534e-01  2.881e-01  -0.533 0.594346    
## cleaning_fee                    6.693e+00  3.515e-01  19.040  < 2e-16 ***
## guests_included                 3.265e+00  3.357e-01   9.726  < 2e-16 ***
## minimum_nights                 -5.504e+00  3.725e-01 -14.779  < 2e-16 ***
## minimum_nights_avg_ntm          1.565e+00  3.738e-01   4.186 2.85e-05 ***
## availability_60                 4.640e+00  3.686e-01  12.586  < 2e-16 ***
## availability_365                1.367e+00  3.738e-01   3.658 0.000255 ***
## first_review                   -3.232e-01  2.856e-01  -1.132 0.257805    
## calculated_host_listings_count -3.862e+00  3.003e-01 -12.858  < 2e-16 ***
## reviews_per_month              -4.473e+00  2.920e-01 -15.321  < 2e-16 ***
## pack_n_play_travel_crib_amen    1.162e+01  1.481e+00   7.850 4.27e-15 ***
## record_count_c                  2.217e+00  2.872e-01   7.717 1.21e-14 ***
## price_mean_c                    1.076e+01  6.683e-01  16.104  < 2e-16 ***
## price_mean_zip                  1.236e+01  6.634e-01  18.625  < 2e-16 ***
## n_commas                        3.275e-01  2.772e-01   1.181 0.237501    
## w50                            -6.934e-02  2.704e-01  -0.256 0.797646    
## w127                            5.138e+00  2.641e-01  19.451  < 2e-16 ***
## w160                           -3.651e-01  2.615e-01  -1.396 0.162650    
## w194                            8.452e-01  2.675e-01   3.160 0.001578 ** 
## n_charsperword_2pow             6.999e-02  2.510e-01   0.279 0.780321    
## w163_2pow                      -1.331e-01  2.722e-01  -0.489 0.624822    
## cluster                         7.191e-01  6.132e-02  11.726  < 2e-16 ***
## price_mean_clus                 6.899e-01  3.688e-03 187.040  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.62 on 41299 degrees of freedom
## Multiple R-squared:  0.7671, Adjusted R-squared:  0.7669 
## F-statistic:  4533 on 30 and 41299 DF,  p-value: < 2.2e-16
prep <- predict(linear, newdata = testData)

# writeSubmit(pred)

XGBoost

XGBoost has a great reputation on kaggle. We will see how it performs on our data set. Instead of grid search, we will be using the Bayes method to find the best hyper parameter set, as it performs faster than grid search and yield better result than randomly selecting hyper parameter values.

# Create training matrix 
BoostTrainData <- xgb.DMatrix(model.matrix(price ~ ., data = trainData),
                              label = as.matrix(trainData %>% select(price)))

Make objective function

For the Bayes method, we need an objective function that allows the algorithm to collect data/error score on different sets of hyper parameters. Inside of this objective function, each time it runs, an XGboost with 10 fold cv will be run with selected hyper parameter set, the test error score (RMSE) will then be recorded for this specific set of hyper parameter.

# objective function for bayes hyperparameter tuning method 
obj.fun <- makeSingleObjectiveFunction(
  # name of the objective function
  name = "xgb_cv_bayes",
  
  # the xgboost function 
  fn =   function(x) {
    set.seed(seed)
    print(x)
    cv <- xgb.cv(
      params = list(
        booster          = "gbtree",
        eta                    = x["eta"],
        max_depth              = x["max_depth"],
        min_child_weight       = x["min_child_weight"],
        gamma                  = x["gamma"],
        lambda                 = x["lambda"],
        alpha                  = x["alpha"],
        subsample              = x["subsample"],
        colsample_bytree       = x["colsample_bytree"],
        max_delta_step         = x["max_delta_step"],
        tweedie_variance_power = x["tweedie_variance_power"],
        objective              = 'reg:tweedie',
        eval_metric            = 'rmse'
      ),
      data = BoostTrainData,
      nround = 7000,
      nthread = cores,
      nfold =  10,
      prediction = FALSE,
      showsd = TRUE,
      early_stopping_rounds = 5,
      verbose = 1,
      print_every_n = 500
    )
    cv$evaluation_log %>% pull(4) %>% min
  },
  
  # hyperparameters 
  par.set = makeParamSet(
    makeNumericParam("eta",                    lower = 0.005, upper = 0.36),
    makeNumericParam("gamma",                  lower = 1,     upper = 8),
    makeNumericParam("lambda",                 lower = 1,     upper = 8),
    makeNumericParam("alpha",                  lower = 1,     upper = 8),
    makeIntegerParam("max_depth",              lower = 2,     upper = 20),
    makeIntegerParam("min_child_weight",       lower = 1,     upper = 2000),
    makeNumericParam("subsample",              lower = 0.01,  upper = 1),
    makeNumericParam("colsample_bytree",       lower = 0.01,  upper = 1),
    makeNumericParam("max_delta_step",         lower = 0,     upper = 10),
    makeNumericParam("tweedie_variance_power", lower = 1,     upper = 2)
  ),
  
  # objective (minimizing rmse)
  minimize = TRUE
)

Make driver function

The driver function for the Bayes method is mainly modeling the RMSE with tested hyper parameter sets. It then will iterate the process (run xgboost with more hyper parameter sets) to optimize the model and minimize the objective(test RMSE). Finally it will return the best performing hyper parameter set.

# Driver function 
do_bayes <-
  function(n_design = NULL,
           opt_steps = NULL,
           of = obj.fun,
           seed = seed) {
    set.seed(seed)
    
    des <- generateDesign(n = n_design,
                          par.set = getParamSet(of),
                          fun = lhs::randomLHS)
    
    control <-
      makeMBOControl() %>% setMBOControlTermination(., iters = opt_steps)
    
    # modeling rmse from hyperparameters (actrual driver function)
    run <- mbo(
      fun = of,
      design = des,
      learner = makeLearner(
        "regr.km",
        predict.type = "se",
        covtype = "matern3_2",
        control = list(trace = FALSE)
      ),
      control = control,
      show.info = TRUE
    )
    
    # ploting the bayes result
    opt_plot <- run$opt.path$env$path %>%
      mutate(Round = row_number()) %>%
      mutate(type = case_when(Round <= n_design ~ "Design",
                              TRUE ~ "mlrMBO optimization")) %>%
      ggplot(aes(x = Round, y = y, color = type)) +
      geom_point() +
      labs(title = "mlrMBO optimization") +
      ylab("-log(likelihood)")
    
    return(list(run = run, plot = opt_plot))
  }

Run Bayes Run

Enough talking, the following code runs the Bayes method to tune the XGBoost model. After it finish tuning, the best performing hyper parameter set will be used to generate the final model for sumission.

# Let's go!!! 
# with 20 initial runs that will be used to create model and follow with another 5 runs to optimize the model, and lastly the best hyper parameter set will be generated (increase these numbers to yield better result)
runs <-
  do_bayes(
    n_design = 15, # 500
    of = obj.fun,
    opt_steps = 5, # 1000
    seed = seed
  )
## Computing y column(s) for design. Not provided.
##                    eta                  gamma                 lambda 
##              0.1433897              2.1655127              3.0476909 
##                  alpha              max_depth       min_child_weight 
##              3.4483429             18.0000000           1252.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.5097477              0.4863149              4.9631787 
## tweedie_variance_power 
##              1.4811994 
## [1]  train-rmse:175.849486+0.346263  test-rmse:175.822342+3.113339 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [80] train-rmse:38.942908+0.298351   test-rmse:47.661586+0.920629
## 
##                    eta                  gamma                 lambda 
##             0.34822109             7.69848880             5.96999307 
##                  alpha              max_depth       min_child_weight 
##             4.90819985             3.00000000           605.00000000 
##              subsample       colsample_bytree         max_delta_step 
##             0.05886207             0.68459279             5.60849741 
## tweedie_variance_power 
##             1.81771245 
## [1]  train-rmse:175.778471+0.346324  test-rmse:175.751070+3.113864 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [130]    train-rmse:57.547262+0.699675   test-rmse:57.725537+2.503902
## 
##                    eta                  gamma                 lambda 
##           1.990936e-02           1.552453e+00           7.936367e+00 
##                  alpha              max_depth       min_child_weight 
##           6.481732e+00           6.000000e+00           1.950000e+03 
##              subsample       colsample_bytree         max_delta_step 
##           3.645677e-01           3.089692e-01           4.583122e+00 
## tweedie_variance_power 
##           1.572117e+00 
## [1]  train-rmse:175.969369+0.346170  test-rmse:175.942174+3.112450 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## [501]    train-rmse:49.610105+0.111876   test-rmse:50.899946+1.257388 
## [1001]   train-rmse:46.847943+0.104132   test-rmse:49.344957+0.970688 
## Stopping. Best iteration:
## [1059]   train-rmse:46.634659+0.106617   test-rmse:49.258266+0.969962
## 
##                    eta                  gamma                 lambda 
##             0.07220784             7.21247581             4.59720215 
##                  alpha              max_depth       min_child_weight 
##             1.12367498             5.00000000            65.00000000 
##              subsample       colsample_bytree         max_delta_step 
##             0.29149216             0.44326297             2.60879215 
## tweedie_variance_power 
##             1.72679163 
## [1]  train-rmse:175.942735+0.346202  test-rmse:175.915547+3.112651 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [465]    train-rmse:47.670128+0.098051   test-rmse:49.052876+1.013779
## 
##                    eta                  gamma                 lambda 
##              0.2718219              2.5500253              5.6599657 
##                  alpha              max_depth       min_child_weight 
##              5.9559969             14.0000000            729.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.7628510              0.9775092              1.4348977 
## tweedie_variance_power 
##              1.7715245 
## [1]  train-rmse:175.819989+0.346289  test-rmse:175.792921+3.113554 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [57] train-rmse:44.483513+0.144819   test-rmse:47.899496+0.863629
## 
##                    eta                  gamma                 lambda 
##              0.2236498              6.5824342              5.0139707 
##                  alpha              max_depth       min_child_weight 
##              1.7070139              8.0000000            518.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.1660301              0.2270238              0.7804302 
## tweedie_variance_power 
##              1.8781448 
## [1]  train-rmse:175.908124+0.346218  test-rmse:175.881842+3.112901 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [137]    train-rmse:55.744936+0.442474   test-rmse:55.977466+1.493213
## 
##                    eta                  gamma                 lambda 
##              0.1584083              6.7180503              7.1675856 
##                  alpha              max_depth       min_child_weight 
##              2.7147090             19.0000000            227.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.5612247              0.7629903              3.2139675 
## tweedie_variance_power 
##              1.3945905 
## [1]  train-rmse:175.793785+0.346342  test-rmse:175.766392+3.113746 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [55] train-rmse:28.004347+0.132317   test-rmse:47.451966+1.311953
## 
##                    eta                  gamma                 lambda 
##              0.2010198              4.7151822              6.9399174 
##                  alpha              max_depth       min_child_weight 
##              7.8536417             16.0000000            987.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.9655378              0.9049767              9.7988411 
## tweedie_variance_power 
##              1.4610456 
## [1]  train-rmse:175.773486+0.346340  test-rmse:175.745911+3.113904 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [43] train-rmse:34.270440+0.184371   test-rmse:46.353579+0.997256
## 
##                    eta                  gamma                 lambda 
##              0.3273010              5.3999300              6.3863042 
##                  alpha              max_depth       min_child_weight 
##              3.8272900              2.0000000           1387.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.8150512              0.1969072              6.7685915 
## tweedie_variance_power 
##              1.3202279 
## [1]  train-rmse:175.306444+0.346752  test-rmse:175.278867+3.117355 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [177]    train-rmse:48.534345+0.733089   test-rmse:50.266158+1.156792
## 
##                    eta                  gamma                 lambda 
##              0.3019416              4.1090816              3.4134593 
##                  alpha              max_depth       min_child_weight 
##              2.2287592             17.0000000            364.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.6303300              0.6679775              0.6449322 
## tweedie_variance_power 
##              1.0484468 
## [1]  train-rmse:175.899840+0.346229  test-rmse:175.872408+3.112977 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [45] train-rmse:17.006546+0.619475   test-rmse:52.741373+1.020810
## 
##                    eta                  gamma                 lambda 
##             0.03475413             5.77016770             2.21008797 
##                  alpha              max_depth       min_child_weight 
##             5.32484333            15.00000000           905.00000000 
##              subsample       colsample_bytree         max_delta_step 
##             0.68109201             0.03669035             8.31438419 
## tweedie_variance_power 
##             1.99120712 
## [1]  train-rmse:175.969275+0.346163  test-rmse:175.942012+3.112453 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## [501]    train-rmse:66.110576+1.265927   test-rmse:66.360620+3.592282 
## [1001]   train-rmse:58.655420+0.358635   test-rmse:58.941298+1.882641 
## Stopping. Best iteration:
## [1153]   train-rmse:58.262771+0.213446   test-rmse:58.538830+1.784587
## 
##                    eta                  gamma                 lambda 
##           9.790842e-02           3.744923e+00           4.114621e+00 
##                  alpha              max_depth       min_child_weight 
##           7.043411e+00           1.200000e+01           1.144000e+03 
##              subsample       colsample_bytree         max_delta_step 
##           9.819786e-02           9.892343e-02           6.013350e+00 
## tweedie_variance_power 
##           1.166854e+00 
## [1]  train-rmse:175.682570+0.346422  test-rmse:175.655051+3.114572 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [164]    train-rmse:46.865919+0.539149   test-rmse:54.303232+1.403258
## 
##                    eta                  gamma                 lambda 
##              0.1912814              1.0029404              2.4336267 
##                  alpha              max_depth       min_child_weight 
##              4.3432986             10.0000000           1832.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.2721438              0.3707872              8.7162563 
## tweedie_variance_power 
##              1.1099212 
## [1]  train-rmse:174.330325+0.347725  test-rmse:174.302690+3.124370 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [32] train-rmse:43.197246+0.735794   test-rmse:51.808310+1.724017
## 
##                    eta                  gamma                 lambda 
##              0.1176200              2.9306495              1.6517395 
##                  alpha              max_depth       min_child_weight 
##              2.9716598             10.0000000           1653.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.4102853              0.5529085              7.3697527 
## tweedie_variance_power 
##              1.6446855 
## [1]  train-rmse:175.905284+0.346218  test-rmse:175.878613+3.112913 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [233]    train-rmse:44.555446+0.196047   test-rmse:48.913560+0.908900
## 
##                    eta                  gamma                 lambda 
##              0.2487132              4.7803367              1.0335552 
##                  alpha              max_depth       min_child_weight 
##              7.4775389              8.0000000           1576.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.9328213              0.8088376              3.3993612 
## tweedie_variance_power 
##              1.2532554 
## [1]  train-rmse:175.467448+0.346584  test-rmse:175.440715+3.116189 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [32] train-rmse:36.999666+0.303130   test-rmse:46.991657+1.150315
## [mbo] 0: eta=0.143; gamma=2.17; lambda=3.05; alpha=3.45; max_depth=18; min_child_weight=1252; subsample=0.51; colsample_bytree=0.486; max_delta_step=4.96; tweedie_variance_power=1.48 : y = 47.7 : 29.5 secs : initdesign
## [mbo] 0: eta=0.348; gamma=7.7; lambda=5.97; alpha=4.91; max_depth=3; min_child_weight=605; subsample=0.0589; colsample_bytree=0.685; max_delta_step=5.61; tweedie_variance_power=1.82 : y = 57.7 : 10.7 secs : initdesign
## [mbo] 0: eta=0.0199; gamma=1.55; lambda=7.94; alpha=6.48; max_depth=6; min_child_weight=1950; subsample=0.365; colsample_bytree=0.309; max_delta_step=4.58; tweedie_variance_power=1.57 : y = 49.3 : 137.8 secs : initdesign
## [mbo] 0: eta=0.0722; gamma=7.21; lambda=4.6; alpha=1.12; max_depth=5; min_child_weight=65; subsample=0.291; colsample_bytree=0.443; max_delta_step=2.61; tweedie_variance_power=1.73 : y = 49.1 : 56.9 secs : initdesign
## [mbo] 0: eta=0.272; gamma=2.55; lambda=5.66; alpha=5.96; max_depth=14; min_child_weight=729; subsample=0.763; colsample_bytree=0.978; max_delta_step=1.43; tweedie_variance_power=1.77 : y = 47.9 : 24.5 secs : initdesign
## [mbo] 0: eta=0.224; gamma=6.58; lambda=5.01; alpha=1.71; max_depth=8; min_child_weight=518; subsample=0.166; colsample_bytree=0.227; max_delta_step=0.78; tweedie_variance_power=1.88 : y = 56 : 15.5 secs : initdesign
## [mbo] 0: eta=0.158; gamma=6.72; lambda=7.17; alpha=2.71; max_depth=19; min_child_weight=227; subsample=0.561; colsample_bytree=0.763; max_delta_step=3.21; tweedie_variance_power=1.39 : y = 47.5 : 37.0 secs : initdesign
## [mbo] 0: eta=0.201; gamma=4.72; lambda=6.94; alpha=7.85; max_depth=16; min_child_weight=987; subsample=0.966; colsample_bytree=0.905; max_delta_step=9.8; tweedie_variance_power=1.46 : y = 46.4 : 24.7 secs : initdesign
## [mbo] 0: eta=0.327; gamma=5.4; lambda=6.39; alpha=3.83; max_depth=2; min_child_weight=1387; subsample=0.815; colsample_bytree=0.197; max_delta_step=6.77; tweedie_variance_power=1.32 : y = 50.3 : 12.1 secs : initdesign
## [mbo] 0: eta=0.302; gamma=4.11; lambda=3.41; alpha=2.23; max_depth=17; min_child_weight=364; subsample=0.63; colsample_bytree=0.668; max_delta_step=0.645; tweedie_variance_power=1.05 : y = 52.7 : 18.9 secs : initdesign
## [mbo] 0: eta=0.0348; gamma=5.77; lambda=2.21; alpha=5.32; max_depth=15; min_child_weight=905; subsample=0.681; colsample_bytree=0.0367; max_delta_step=8.31; tweedie_variance_power=1.99 : y = 58.5 : 112.8 secs : initdesign
## [mbo] 0: eta=0.0979; gamma=3.74; lambda=4.11; alpha=7.04; max_depth=12; min_child_weight=1144; subsample=0.0982; colsample_bytree=0.0989; max_delta_step=6.01; tweedie_variance_power=1.17 : y = 54.3 : 27.4 secs : initdesign
## [mbo] 0: eta=0.191; gamma=1; lambda=2.43; alpha=4.34; max_depth=10; min_child_weight=1832; subsample=0.272; colsample_bytree=0.371; max_delta_step=8.72; tweedie_variance_power=1.11 : y = 51.8 : 8.2 secs : initdesign
## [mbo] 0: eta=0.118; gamma=2.93; lambda=1.65; alpha=2.97; max_depth=10; min_child_weight=1653; subsample=0.41; colsample_bytree=0.553; max_delta_step=7.37; tweedie_variance_power=1.64 : y = 48.9 : 53.7 secs : initdesign
## [mbo] 0: eta=0.249; gamma=4.78; lambda=1.03; alpha=7.48; max_depth=8; min_child_weight=1576; subsample=0.933; colsample_bytree=0.809; max_delta_step=3.4; tweedie_variance_power=1.25 : y = 47 : 10.4 secs : initdesign
##                    eta                  gamma                 lambda 
##              0.1798297              2.9852834              5.7587270 
##                  alpha              max_depth       min_child_weight 
##              6.5558814             10.0000000           1206.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.7898417              0.9367973              3.3939614 
## tweedie_variance_power 
##              1.4980765 
## [1]  train-rmse:175.815965+0.346299  test-rmse:175.788626+3.113577 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [61] train-rmse:38.718709+0.254796   test-rmse:46.401449+1.123193
## [mbo] 1: eta=0.18; gamma=2.99; lambda=5.76; alpha=6.56; max_depth=10; min_child_weight=1206; subsample=0.79; colsample_bytree=0.937; max_delta_step=3.39; tweedie_variance_power=1.5 : y = 46.4 : 22.6 secs : infill_cb
##                    eta                  gamma                 lambda 
##              0.1541928              4.3861109              5.5257429 
##                  alpha              max_depth       min_child_weight 
##              5.0235506             19.0000000           1264.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.9086676              0.7783669              4.1549332 
## tweedie_variance_power 
##              1.3475073 
## [1]  train-rmse:175.769217+0.346364  test-rmse:175.742021+3.113918 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [53] train-rmse:28.683528+0.132036   test-rmse:46.496345+1.277794
## [mbo] 2: eta=0.154; gamma=4.39; lambda=5.53; alpha=5.02; max_depth=19; min_child_weight=1264; subsample=0.909; colsample_bytree=0.778; max_delta_step=4.15; tweedie_variance_power=1.35 : y = 46.5 : 33.2 secs : infill_cb
##                    eta                  gamma                 lambda 
##              0.1846294              3.8622876              2.0013141 
##                  alpha              max_depth       min_child_weight 
##              7.7323198             17.0000000            539.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.9134028              0.8936029              4.5158724 
## tweedie_variance_power 
##              1.3751894 
## [1]  train-rmse:175.739752+0.346364  test-rmse:175.712285+3.114134 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [45] train-rmse:23.743065+0.272348   test-rmse:46.974959+1.235782
## [mbo] 3: eta=0.185; gamma=3.86; lambda=2; alpha=7.73; max_depth=17; min_child_weight=539; subsample=0.913; colsample_bytree=0.894; max_delta_step=4.52; tweedie_variance_power=1.38 : y = 47 : 31.5 secs : infill_cb
##                    eta                  gamma                 lambda 
##              0.1235886              3.0338489              6.2845442 
##                  alpha              max_depth       min_child_weight 
##              3.1953927             19.0000000            708.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.9193236              0.8764768              2.7139873 
## tweedie_variance_power 
##              1.5308240 
## [1]  train-rmse:175.882330+0.346252  test-rmse:175.854909+3.113097 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [88] train-rmse:32.019518+0.194947   test-rmse:45.989405+0.970723
## [mbo] 4: eta=0.124; gamma=3.03; lambda=6.28; alpha=3.2; max_depth=19; min_child_weight=708; subsample=0.919; colsample_bytree=0.876; max_delta_step=2.71; tweedie_variance_power=1.53 : y = 46 : 52.0 secs : infill_cb
##                    eta                  gamma                 lambda 
##              0.1394113              3.1417033              7.9378679 
##                  alpha              max_depth       min_child_weight 
##              6.7552343              8.0000000            431.0000000 
##              subsample       colsample_bytree         max_delta_step 
##              0.9315066              0.8207677              4.1949087 
## tweedie_variance_power 
##              1.6853234 
## [1]  train-rmse:175.896364+0.346238  test-rmse:175.868819+3.112994 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [143]    train-rmse:40.871966+0.107210   test-rmse:46.315541+0.825370
## [mbo] 5: eta=0.139; gamma=3.14; lambda=7.94; alpha=6.76; max_depth=8; min_child_weight=431; subsample=0.932; colsample_bytree=0.821; max_delta_step=4.19; tweedie_variance_power=1.69 : y = 46.3 : 38.4 secs : infill_cb
plot(runs$run)
## Loading required package: grid
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

best.params <- runs$run$x

# run the model with the best hyperparamerter set
set.seed(seed)
optimal.cv <- xgb.cv(
  params = best.params,
  data = BoostTrainData,
  nround = 7000,
  nthread = cores,
  nfold =  10,
  prediction = FALSE,
  showsd = TRUE,
  early_stopping_rounds = 5,
  verbose = 1,
  print_every_n = 100, 
  objective = 'reg:tweedie',
  eval_metric = 'rmse'
)
## [1]  train-rmse:175.882329+0.346253  test-rmse:175.854912+3.113096 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [88] train-rmse:32.019517+0.194948   test-rmse:45.989404+0.970722
# make the final model
set.seed(seed)
model <-
  xgboost(
    params = best.params,
    data = BoostTrainData,
    nrounds = optimal.cv$best_ntreelimit
  )
## [21:22:39] WARNING: amalgamation/../src/learner.cc:573: 
## Parameters: { "tweedie_variance_power" } might not be used.
## 
##   This may not be accurate due to some parameters are only used in language bindings but
##   passed down to XGBoost core.  Or some parameters are not used but slip through this
##   verification. Please open an issue if you find above cases.
## 
## 
## [1]  train-rmse:175.723572 
## [2]  train-rmse:175.462982 
## [3]  train-rmse:175.203705 
## [4]  train-rmse:174.944748 
## [5]  train-rmse:174.685608 
## [6]  train-rmse:174.427063 
## [7]  train-rmse:174.168762 
## [8]  train-rmse:173.910477 
## [9]  train-rmse:173.652267 
## [10] train-rmse:173.394501 
## [11] train-rmse:173.136703 
## [12] train-rmse:172.879425 
## [13] train-rmse:172.622498 
## [14] train-rmse:172.366196 
## [15] train-rmse:172.110199 
## [16] train-rmse:171.853424 
## [17] train-rmse:171.598679 
## [18] train-rmse:171.342407 
## [19] train-rmse:171.086716 
## [20] train-rmse:170.831055 
## [21] train-rmse:170.577621 
## [22] train-rmse:170.323364 
## [23] train-rmse:170.069031 
## [24] train-rmse:169.815414 
## [25] train-rmse:169.561539 
## [26] train-rmse:169.308365 
## [27] train-rmse:169.055222 
## [28] train-rmse:168.802490 
## [29] train-rmse:168.549866 
## [30] train-rmse:168.297882 
## [31] train-rmse:168.045532 
## [32] train-rmse:167.794220 
## [33] train-rmse:167.543152 
## [34] train-rmse:167.292191 
## [35] train-rmse:167.041199 
## [36] train-rmse:166.791031 
## [37] train-rmse:166.540985 
## [38] train-rmse:166.291321 
## [39] train-rmse:166.041840 
## [40] train-rmse:165.792297 
## [41] train-rmse:165.543625 
## [42] train-rmse:165.295258 
## [43] train-rmse:165.047195 
## [44] train-rmse:164.798660 
## [45] train-rmse:164.550827 
## [46] train-rmse:164.303635 
## [47] train-rmse:164.056915 
## [48] train-rmse:163.807983 
## [49] train-rmse:163.563339 
## [50] train-rmse:163.317780 
## [51] train-rmse:163.072449 
## [52] train-rmse:162.826447 
## [53] train-rmse:162.581375 
## [54] train-rmse:162.336700 
## [55] train-rmse:162.091904 
## [56] train-rmse:161.847977 
## [57] train-rmse:161.603806 
## [58] train-rmse:161.360504 
## [59] train-rmse:161.117310 
## [60] train-rmse:160.874939 
## [61] train-rmse:160.631699 
## [62] train-rmse:160.390198 
## [63] train-rmse:160.148087 
## [64] train-rmse:159.906403 
## [65] train-rmse:159.665649 
## [66] train-rmse:159.424744 
## [67] train-rmse:159.184143 
## [68] train-rmse:158.944031 
## [69] train-rmse:158.704697 
## [70] train-rmse:158.464340 
## [71] train-rmse:158.225388 
## [72] train-rmse:157.987122 
## [73] train-rmse:157.748276 
## [74] train-rmse:157.510681 
## [75] train-rmse:157.272934 
## [76] train-rmse:157.034973 
## [77] train-rmse:156.798660 
## [78] train-rmse:156.562012 
## [79] train-rmse:156.325256 
## [80] train-rmse:156.089981 
## [81] train-rmse:155.854843 
## [82] train-rmse:155.618713 
## [83] train-rmse:155.384979 
## [84] train-rmse:155.150299 
## [85] train-rmse:154.916473 
## [86] train-rmse:154.682724 
## [87] train-rmse:154.449814 
## [88] train-rmse:154.216431
# take a peek
summary(model)
##                Length Class              Mode       
## handle             1  xgb.Booster.handle externalptr
## raw            40092  -none-             raw        
## niter              1  -none-             numeric    
## evaluation_log     2  data.table         list       
## call              13  -none-             call       
## params            11  -none-             list       
## callbacks          2  -none-             list       
## feature_names     31  -none-             character  
## nfeatures          1  -none-             numeric

Write submission

# predict 
pred <-
  predict(model, model.matrix(price~., testData %>% mutate(price = -1)) %>% xgb.DMatrix())

# make submission csv
# writeSubmit(pred)

ANN

This is truly an artificial stupidity. (Such a shame)

Build Net

NetTrainX <- trainData %>% select(-price) %>% as.matrix() 
NetTrainY <- trainData %>% select(price) %>% as.matrix()

inputSize <- dim(NetTrainX)[2] 
offsetSize <- 0.45
dropout <- 0.5
activate <- "relu"

scaleSize <- inputSize * offsetSize

model <- keras_model_sequential() %>%
  layer_dense(units = scaleSize*2, activation = activate, input_shape = dim(NetTrainX)[2]) %>%
  layer_dropout(rate = dropout) %>%
  layer_dense(units = scaleSize*4, activation = activate) %>%
  layer_dropout(rate = dropout) %>%
  layer_dense(units = scaleSize*8, activation = activate) %>%
  layer_dropout(rate = dropout) %>%
  layer_dense(units = scaleSize*4, activation = activate) %>%
  layer_dropout(rate = dropout) %>%
  layer_dense(units = scaleSize*2, activation = activate) %>%
  layer_dropout(rate = dropout) %>%
  layer_dense(units = scaleSize*0.5, activation = activate) %>%
  layer_dense(units = scaleSize*0.25, activation = activate) %>%
  layer_dense(units = 1, activation = activate)

model %>% compile(
   loss = "mse",
   optimizer =  "nadam", 
   metrics = c("mape","mse")
 )
 
model %>% summary()
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_7 (Dense)                     (None, 27)                      837         
## ________________________________________________________________________________
## dropout_4 (Dropout)                 (None, 27)                      0           
## ________________________________________________________________________________
## dense_6 (Dense)                     (None, 54)                      1512        
## ________________________________________________________________________________
## dropout_3 (Dropout)                 (None, 54)                      0           
## ________________________________________________________________________________
## dense_5 (Dense)                     (None, 108)                     5940        
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 108)                     0           
## ________________________________________________________________________________
## dense_4 (Dense)                     (None, 54)                      5886        
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 54)                      0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 27)                      1485        
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 27)                      0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 6)                       168         
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 3)                       21          
## ________________________________________________________________________________
## dense (Dense)                       (None, 1)                       4           
## ================================================================================
## Total params: 15,853
## Trainable params: 15,853
## Non-trainable params: 0
## ________________________________________________________________________________

Train Net

set.seed(seed)
model %>% fit(
  NetTrainX,
  NetTrainY,
  epochs = 20, # change to make it think longer
  batch_size = 128,
  validation_split = 0.2,
  verbose = 2
)

scores <- model %>% evaluate(NetTrainX, NetTrainY, verbose = 0)
print(scores)
##       loss       mape        mse 
##   17233.57 9187553.00   17233.57

Write submission

summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_7 (Dense)                     (None, 27)                      837         
## ________________________________________________________________________________
## dropout_4 (Dropout)                 (None, 27)                      0           
## ________________________________________________________________________________
## dense_6 (Dense)                     (None, 54)                      1512        
## ________________________________________________________________________________
## dropout_3 (Dropout)                 (None, 54)                      0           
## ________________________________________________________________________________
## dense_5 (Dense)                     (None, 108)                     5940        
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 108)                     0           
## ________________________________________________________________________________
## dense_4 (Dense)                     (None, 54)                      5886        
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 54)                      0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 27)                      1485        
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 27)                      0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 6)                       168         
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 3)                       21          
## ________________________________________________________________________________
## dense (Dense)                       (None, 1)                       4           
## ================================================================================
## Total params: 15,853
## Trainable params: 15,853
## Non-trainable params: 0
## ________________________________________________________________________________
pred <- predict(model, testData %>% as.matrix())

# writeSubmit(pred)

Conclusion

In conclusion, there are still so much room for improvement, like outlier handling, better NA value handling, or better clustering method. If time is permitted, it might be better to manually implement the essemble learning process for xgboost and mixing it with random forest (just a thought, might need years to get it done). Other than that, the current model (the one that produces best RMSE on kaggle) takes too long to run (about 3 days) and most of the time are wasted on feature selection, finding the most cluster number, and tuning XGBoost.